29 research outputs found

    Research on High-performance and Scalable Data Access in Parallel Big Data Computing

    Get PDF
    To facilitate big data processing, many dedicated data-intensive storage systems such as Google File System(GFS), Hadoop Distributed File System(HDFS) and Quantcast File System(QFS) have been developed. Currently, the Hadoop Distributed File System(HDFS) [20] is the state-of-art and most popular open-source distributed file system for big data processing. It is widely deployed as the bedrock for many big data processing systems/frameworks, such as the script-based pig system, MPI-based parallel programs, graph processing systems and scala/java-based Spark frameworks. These systems/applications employ parallel processes/executors to speed up data processing within scale-out clusters. Job or task schedulers in parallel big data applications such as mpiBLAST and ParaView can maximize the usage of computing resources such as memory and CPU by tracking resource consumption/availability for task assignment. However, since these schedulers do not take the distributed I/O resources and global data distribution into consideration, the data requests from parallel processes/executors in big data processing will unfortunately be served in an imbalanced fashion on the distributed storage servers. These imbalanced access patterns among storage nodes are caused because a). unlike conventional parallel file system using striping policies to evenly distribute data among storage nodes, data-intensive file systems such as HDFS store each data unit, referred to as chunk or block file, with several copies based on a relative random policy, which can result in an uneven data distribution among storage nodes; b). based on the data retrieval policy in HDFS, the more data a storage node contains, the higher the probability that the storage node could be selected to serve the data. Therefore, on the nodes serving multiple chunk files, the data requests from different processes/executors will compete for shared resources such as hard disk head and network bandwidth. Because of this, the makespan of the entire program could be significantly prolonged and the overall I/O performance will degrade. The first part of my dissertation seeks to address aspects of these problems by creating an I/O middleware system and designing matching-based algorithms to optimize data access in parallel big data processing. To address the problem of remote data movement, we develop an I/O middleware system, called SLAM, which allows MPI-based analysis and visualization programs to benefit from locality read, i.e, each MPI process can access its required data from a local or nearby storage node. This can greatly improve the execution performance by reducing the amount of data movement over network. Furthermore, to address the problem of imbalanced data access, we propose a method called Opass, which models the data read requests that are issued by parallel applications to cluster nodes as a graph data structure where edges weights encode the demands of load capacity. We then employ matching-based algorithms to map processes to data to achieve data access in a balanced fashion. The final part of my dissertation focuses on optimizing sub-dataset analyses in parallel big data processing. Our proposed methods can benefit different analysis applications with various computational requirements and the experiments on different cluster testbeds show their applicability and scalability

    Optimize Parallel Data Access In Big Data Processing

    No full text
    Recent years the Hadoop Distributed File System(HDFS) has been deployed as the bedrock for many parallel big data processing systems, such as graph processing systems, MPI-based parallel programs and scala/java-based Spark frameworks, which can efficiently support iterative and interactive data analysis in memory. The first part of my dissertation mainly focuses on studying parallel data accession distributed file systems, e.g, HDFS. Since the distributed I/O resources and global data distribution are often not taken into consideration, the data requests from parallel processes/executors will unfortunately be served in a remoter imbalanced fashion on the storage servers. In order to address these problems, we develop I/O middleware systems and matching-based algorithms to map parallel data requests to storage servers such that local and balanced data access can be achieved. The last part of my dissertation presents our plans to improve the performance of interactive data access in big data analysis. Specifically, most interactive analysis programs will scan through the entire data set regardless of which data is actually required. We plan to develop a content-aware method to quickly access required data without this laborious scanning process

    Sapprox: Enabling Efficient And Accurate Approximations On Sub-Datasets With Distribution-Aware Online Sampling

    No full text
    In this paper, we aim to enable both efficient and accurate approximations on arbitrary sub-datasets of a large dataset. Due to the prohibitive storage overhead of caching offline samples for each sub-dataset, existing offline sample based systems provide high accuracy results for only a limited number of sub-datasets, such as the popular ones. On the other hand, current online sample based approximation systems, which generate samples at runtime, do not take into account the uneven storage distribution of a sub-dataset. They work well for uniform distribution of a sub-dataset while suffer low sampling efficiency and poor estimation accuracy on unevenly distributed sub-datasets. To address the problem, we develop a distribution aware method called Sapprox. Our idea is to collect the occurrences of a sub-dataset at each logical partition of a dataset (storage distribution) in the distributed system, and make good use of such information to facilitate online sampling. There are three thrusts in Sapprox. First, we develop a probabilistic map to reduce the exponential number of recorded sub-datasets to a linear one. Second, we apply the cluster sampling with unequal probability theory to implement a distribution-aware sampling method for efficient online subdataset sampling. Third, we quantitatively derive the optimal sampling unit size in a distributed file system by associating it with approximation costs and accuracy. We have implemented Sapprox into Hadoop ecosystem as an example system and open sourced it on GitHub. Our comprehensive experimental results show that Sapprox can achieve a speedup by up to 20x over the precise execution

    Supervising And Collaborating The Networked-Team By Dcms

    No full text
    The groups of experts in different fields usually need to collaborate in a distributed way such that solutions can be investigated with multi-view and multi-level in analysis and modeling of large-scale, sophisticated systems. Consequently, how to organize, supervise, and evaluate such a distributed modeling networked-team has become an important problem. The methodology and platform-DCMS (Distributed Cooperative Modeling System) have been put forward to support collaboration over such networked-team. In addition, a Soft-Agents network system is designed to perform supervising and collaborating teams and individuals in the networked-team, based on their working progress, diligence, modeling complexity and model updating frequency. Such supervising enable judgment on working characteristics and modeling quality of each team and individual separately, in turn provide a basis for future optimization and coordination of the networked-team organization. © 2013 IEEE

    Dl-Mpi: Enabling Data Locality Computation For Mpi-Based Data-Intensive Applications

    No full text
    Currently, most scientific applications based on MPI adopt a compute-centric architecture. Needed data is accessed by MPI processes running on different nodes through a shared file system. Unfortunately, the explosive growth of scientific data undermines the high performance of MPI-based applications, especially in the execution environment of commodity clusters. In this paper, we present a novel approach to enable data locality computation for MPI-based data-intensive applications and refer to it as DL-MPI. DL-MPI allows MPI-based programs to obtain data distribution information for compute nodes through a novel data locality API. In addition, the problem of allocating data processing tasks to parallel processes is formulated as an integer optimization problem with the objectives of achieving data locality computation and optimal parallel execution time. For heterogeneous runtime environments, we propose a scheduling algorithm based on probability to dynamically schedule tasks to processes by evaluating the unprocessed local data and the computing ability of each compute node. We demonstrate the functionality of our methods through the implementation of scientific data processing programs as well as the incorporation of DL-MPI with existing HPC applications. © 2013 IEEE

    On Balance Among Energy, Performance And Recovery In Storage Systems

    No full text
    With the increasing size of the clusters as well as the increasing capacity of each storage node, current storage systems are spending more time on recovery. When node failure happens, the system enters degradation mode in which node reconstruction/block recovery is initiated. This very process needs to wake up a number of disks and takes a substantial amount of I/O bandwidth which will not only compromise energy efficiency but also performance. This raises a natural problem: how to balance the performance, energy, and recovery in degradation mode for an energy efficient storage system? Without considering the I/O bandwidth contention between recovery and performance, we find that the current energy proportional solutions cannot answer these question accurately. This paper presents a mathematical model called Perfect Energy, Reliability, and Performance (PERP) which provides guidelines of provisioning active nodes number and recovery speed at each time slot with respect to the performance and recovery constraints. We apply our model to practical data layouts and test the effectiveness on our 25-node CASS cluster. Experimental results validate that our model helps realize 25% energy savings while meeting both performance and recovery constraints and the saving is expected to increase with a larger number of nodes

    Taming Big Data Svm With Locality-Aware Scheduling

    No full text
    Incorporating MPI programming model into data-intensive file system for big data application is significant in performance research for optimization purpose. In this paper we ported an MPI-SVM solver, originally developed for HPC environment to the Hadoop distributed file system (HDFS). We analyzed the performance bottlenecks with which the SVM solver will be confronted on the HDFS. It is known the storage expansion on HDFS comes with a skewed data distribution. As a result, we found out that some hot nodes always receive condensed I/O requests while other nodes always send remote requests. These remote requests make the I/O delays elongate on hot nodes, which leads to performance bottleneck for our solver. Thus we specifically improved the data preprocessing part that requires large amount of I/O operations by a deterministic scheduling method. Our improvement showed a balanced read pattern on each node. The time ratio between the longest process and the shortest process has been reduced by 60%. Also the average read time has significantly reduced by 78%. The data served on each node also showed a small variance in comparison with the originally ported SVM algorithm. We believe our design avoids the overhead introduced by remote I/O operations, which will be beneficial to many algorithms when coping with large scale of data

    Accelerating I/O Performance Of Svm On Hdfs

    No full text
    Hadoop distributed file system (HDFS) is a major distributed file system for commodity clusters and cloud computing. Its extensive scalability and replica fault tolerance scheme makes it well suited for data-intensive application. Due to the tremendous growth of data, many computation-centric applications also become data-intensive. However, they are not optimal on HDFS, which leaves plenty of space for performance optimization. In this paper we ported an MPI-SVM solver, originally developed for HPC environment to the HDFS. We specifically improved the data pre-processing part that requires large amount of I/O operations by a deterministic scheduling method. Our improvement showed a balanced read pattern on each node. The time ratio between the longest process and the shortest process has been reduced by 60%. Also the average read time has significantly reduced by 78%. The data served on each node also showed a small variance in comparison with the originally ported SVM algorithm. We believe that our design avoids the overhead introduced by remote I/O operations, which will be beneficial to many algorithms when coping with large scale of data

    G-Sd: Achieving Fast Reverse Lookup Using Scalable Declustering Layout In Large-Scale File Systems

    No full text
    With the increasing popularity of cloud computing, current data centers contain petabytes of data in their datacenters. This requires thousands or tens of thousands of storage nodes at a single site. Node failure in these datacenters is normal instead of a rare situation. As a result, data reliability is a great concern. In order to achieve high reliability, data recovery or node reconstruction is a must. Although extensive research works have investigated how to sustain high performance and high reliability in case of node failure at large scale, a reverse lookup problem, namely finding the list of objects for the failed node is not well-addressed. As the first step of failure recovery, this process has a direct impact to the data recovery/node reconstruction. While existing solutions use metadata traversal or data distribution reversing methods for reverse lookup, which are either time consuming or expensive, the deterministic block placement schemes can achieve fast and efficient reverse lookup easily. However, they are designed for centralized, small-scale storage architectures such as RAID etc. Due to their lacking of scalability, they cannot be directly applied in large-scale storage systems. In this paper, we propose Group-Shifted Declustering (G-SD), a deterministic data layout for multi-way replication. G-SD addresses the scalability issue of our previous Shifted Declustering layout and supports fast and efficient reverse lookup. Our mathematical proofs demonstrate that G-SD is a scalable layout that maintains a high level of data availability. We implement a prototype of G-SD and its reverse lookup function on two open source file systems: Ceph and HDFS. Large scale experiments on the Marmot cluster demonstrate that the average speed of G-SD reverse lookup is more than 5×5\times faster than the reverse lookup speed of existing schemes

    On Balance among Energy, Performance and Recovery in Storage Systems

    No full text
    With the increasing size of the clusters as well as the increasing capacity of each storage node, current storage systems are spending more time on recovery. When node failure happens, the system enters degradation mode in which node reconstruction/block recovery is initiated. This very process needs to wake up a number of disks and takes a substantial amount of I/O bandwidth which will not only compromise energy efficiency but also performance. This raises a natural problem: how to balance the performance, energy, and recovery in degradation mode for an energy efficient storage system? Without considering the I/O bandwidth contention between recovery and performance, we find that the current energy proportional solutions cannot answer these question accurately. This paper presents a mathematical model called Perfect Energy, Reliability, and Performance (PERP) which provides guidelines of provisioning active nodes number and recovery speed at each time slot with respect to the performance and recovery constraints. We apply our model to practical data layouts and test the effectiveness on our 25-node CASS cluster. Experimental results validate that our model helps realize 25% energy savings while meeting both performance and recovery constraints and the saving is expected to increase with a larger number of nodes
    corecore